Objectives

  • Review the mechanics and assumptions of linear regression
  • Identify regression diagnostic tests based on visualization
  • Compare tables vs. graphs for publishing statistical results in academic journals
  • Discuss presenting results in a paper vs. a presentation
library(tidyverse)
library(ggthemes)
library(knitr)
library(broom)
library(stringr)
library(modelr)
library(plotly)

options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())

Linear models and visualization

Linear functional form

Linear models are the simplest functional form to understand. They adopt a generic form

\[Y = \beta_0 + \beta_{1}X\]

where \(y\) is the outcome of interest, \(x\) is the explanatory or predictor variable, and \(\beta_0\) and \(\beta_1\) are parameters that vary to capture different patterns. In algebraic terms, \(\beta_0\) is the intercept and \(\beta_1\) the slope for the linear equation. Given the empirical values you have for \(x\) and \(y\), you generate a fitted model that finds the values for the parameters that best fit the data.

ggplot(sim1, aes(x, y)) + 
  geom_point()

This looks like a linear relationship. We could randomly generate parameters for the formula \(y = \beta_0 + \beta_1 * x\) to try and explain or predict the relationship between \(x\) and \(y\):

models <- tibble(
  a1 = runif(250, -20, 40),
  a2 = runif(250, -5, 5)
)

ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
  geom_point()

But obviously some parameters are better than others. We need a definition that can be used to differentiate good parameters from bad parameters.

Least squares regression

One approach widely used is called least squares - it means that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation. The errors are simply the difference between the actual values for \(y\) and the predicted values for \(y\) (also known as the residuals).

dist1 <- sim1 %>% 
  mutate(
    dodge = rep(c(-1, 0, 1) / 20, 10),
    x1 = x + dodge,
    pred = 7 + x1 * 1.5
  )

ggplot(dist1, aes(x1, y)) + 
  geom_abline(intercept = 7, slope = 1.5, color = "grey40") +
  geom_point(color = "grey40") +
  geom_linerange(aes(ymin = y, ymax = pred), color = "#3366FF")

To estimate a linear regression model in R, we use the lm() function. The lm() function takes two parameters. The first is a formula specifying the equation to be estimated (lm() translates y ~ x into \(y = \beta_0 + \beta_1 * x\)). The second is the data frame containing the variables:

sim1_mod <- lm(y ~ x, data = sim1)

We can use the summary() function to examine key model components, including parameter estimates, standard errors, and model goodness-of-fit statistics.

summary(sim1_mod)
## 
## Call:
## lm(formula = y ~ x, data = sim1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.147 -1.520  0.133  1.467  4.652 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.221      0.869    4.86  4.1e-05 ***
## x              2.052      0.140   14.65  1.2e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.2 on 28 degrees of freedom
## Multiple R-squared:  0.885,  Adjusted R-squared:  0.88 
## F-statistic:  215 on 1 and 28 DF,  p-value: 1.17e-14

The resulting line from this regression model looks like:

dist2 <- sim1 %>%
  add_predictions(sim1_mod) %>%
  mutate(
    dodge = rep(c(-1, 0, 1) / 20, 10),
    x1 = x + dodge
  )

ggplot(dist2, aes(x1, y)) + 
  geom_smooth(method = "lm", color = "grey40") +
  geom_point(color = "grey40") +
  geom_linerange(aes(ymin = y, ymax = pred), color = "#3366FF")

Testing assumptions of linear regression using visualizations

Non-linearity of the data

Linear regression assumes the relationship between the predictors and the response is a straight line. If the true relationship is otherwise, then we cannot generate accurate inferences from the model. Consider the relationship between mpg and horsepower in the ISLR::Auto dataset:

library(ISLR)
data("Auto")

ggplot(Auto, aes(mpg, horsepower)) +
  geom_point() +
  geom_smooth(se = FALSE)

Doesn’t look linear. We could just look at this graph and make that conclusion (a very basic visualization), but what happens once we estimate a model with more than a single predictor? Instead we can use residual plots to identify non-linearity. Recall that the residual is the error for an individual observation \(e_i = y_i - \hat{y}_i\), or the difference between the actual and predicted value for the outcome of interest. Residual plots graph the relationship between the fitted values and the residuals. Ideally there should be no discernable pattern in the graph. If there is a pattern, then this indicates a problem with some aspect of the linear model.

# estimate models
mpg_lin <- lm(horsepower ~ mpg, data = Auto)
mpg_quad <- lm(horsepower ~ poly(mpg, 2, raw = TRUE), data = Auto)

Auto %>%
  add_predictions(mpg_lin) %>%
  add_residuals(mpg_lin) %>%
  ggplot(aes(pred, resid)) +
  geom_point(alpha = .2) +
  geom_smooth(se = FALSE) +
  labs(title = "Residual plot for linear fit")

Auto %>%
  add_predictions(mpg_quad) %>%
  add_residuals(mpg_quad) %>%
  ggplot(aes(pred, resid)) +
  geom_point(alpha = .2) +
  geom_smooth(se = FALSE) +
  labs(title = "Residual plot for quadratic fit")

Non-constant variance of the error terms

Another assumption of linear regression is that the error terms \(\epsilon_i\) have a constant variance, \(\text{Var}(\epsilon_i) = \sigma^2\). This is called homoscedasticity. Estimates of standard errors rely on this assumption. If the variances of the error terms are non-constant (aka heteroscedastic), our estimates of the standard errors will be inaccurate - they will either be inflated or deflated, leading to incorrect inferences about the statistical significance of predictor variables.

We can uncover homo- or heteroscedasticity through the use of the residual plot. Below is data generated from the process:

\[Y = 2 + 3X + \epsilon\]

where \(\epsilon\) is random error distributed normally \(N(0,1)\).

sim_homo <- data_frame(x = runif(1000, 0, 10),
                       y = 2 + 3 * x + rnorm(1000, 0, 1))
sim_homo_mod <- glm(y ~ x, data = sim_homo)

sim_homo %>%
  add_predictions(sim_homo_mod) %>%
  add_residuals(sim_homo_mod) %>%
  ggplot(aes(pred, resid)) +
  geom_point(alpha = .2) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_quantile(method = "rqss", lambda = 5, quantiles = c(.05, .95)) +
  labs(title = "Homoscedastic variance of error terms",
       x = "Predicted values",
       y = "Residuals")

Compare this to a linear model fit to the data generating process:

\[Y = 2 + 3X + \epsilon\]

where \(\epsilon\) is random error distributed normally \(N(0,\frac{X}{2})\). Note that the variance for the error term of each observation \(\epsilon_i\) is not constant, and is itself a function of \(X\).

sim_hetero <- data_frame(x = runif(1000, 0, 10),
                       y = 2 + 3 * x + rnorm(1000, 0, (x / 2)))
sim_hetero_mod <- glm(y ~ x, data = sim_hetero)

sim_hetero %>%
  add_predictions(sim_hetero_mod) %>%
  add_residuals(sim_hetero_mod) %>%
  ggplot(aes(pred, resid)) +
  geom_point(alpha = .2) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_quantile(method = "rqss", lambda = 5, quantiles = c(.05, .95)) +
  labs(title = "Heteroscedastic variance of error terms",
       x = "Predicted values",
       y = "Residuals")

We see a distinct funnel-shape to the relationship between the predicted values and the residuals. This is because by assuming the variance is constant, we substantially over or underestimate the actual response \(Y_i\) as \(X_i\) increases.

Outliers

An outlier is an observation for which the predicted value \(\hat{y}_i\) is far from the actual value \(y_i\). Sometimes outliers are simply coding errors in the original dataset, but sometimes they are extreme or unusual values that were not generated by the same data generating process as the remaining dataset. Detecting outliers is the first step to deciding how to handle them (i.e. keep or remove them from the analysis).

We can use a few different visualizations to detect outliers. In a bivariate linear regression model, simply plot the variables against one another and draw the best-fit line:

sim <- data_frame(x = rnorm(20),
                  y = -2 + x + rnorm(20)) %>%
  mutate(y = ifelse(row_number(x) == 15, y + 10, y),
         outlier = ifelse(row_number(x) == 15, TRUE, FALSE))

ggplot(sim, aes(x, y)) +
  geom_point(aes(color = outlier)) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(data = filter(sim, outlier == FALSE),
              method = "lm", se = FALSE, linetype = 2) +
  scale_color_manual(values = c("black", "red")) +
  theme(legend.position = "none")

Or we can draw another residual plot (either raw or standardized):

sim_lm <- lm(y ~ x, data = sim)

sim %>%
  add_predictions(sim_lm) %>%
  add_residuals(sim_lm) %>%
  ggplot(aes(pred, resid)) +
  geom_point(aes(color = outlier)) +
  labs(x = "Fitted values",
       y = "Residuals") +
  scale_color_manual(values = c("black", "red")) +
  theme(legend.position = "none")

augment(sim_lm) %>%
  left_join(sim) %>%
  ggplot(aes(.fitted, .std.resid)) +
  geom_point(aes(color = outlier)) +
  labs(x = "Fitted values",
       y = "Standardized residuals") +
  scale_color_manual(values = c("black", "red")) +
  theme(legend.position = "none")

High-leverage points

High-leverage points are observations with have strong effects on the coefficient estimates in a linear model. Influence is a function of leverage and discrepancy:

\[\text{Influence} = \text{Leverage} \times \text{Discrepancy}\]

  • Leverage - degree of potential influence on the coefficient estimates that a given observation can (but not necessarily does) have
  • Discrepancy - extent to which an observation is “unusual” or “different” from the rest of the data

Various statistical measures exist for quantifying leverage and discrepancy. Leverage is frequently defined by the leverage statistic (also known as the hat value):

\[h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{i'=1}^{n} (x_{i'} - \bar{x})^2}\]

Residuals are commonly used to measure discrepancy (either raw, standardized, or studentized). Cook’s distance (or Cook’s D) combines these two measures to calculate an observation’s leverage:

\[D_i = \frac{\tilde{u}_i^2}{K} \times \frac{h_i}{1 - h_i}\]

where \(\tilde{u}_i^2\) is the squared standardized residual, \(K\) is the number of parameters in the model, and \(\frac{h_i}{1 - h_i}\) is the hat value.

Where is the visualization? By combining all three variables into a “bubble plot”, we can visualize all three variables simultaneously. For example, here are the results of a basic model on the numer of federal laws struck down by the U.S. Supreme Court in each Congress, based on:

  1. age - the mean age of the members of the Supreme Court
  2. tenure - mean tenure of the members of the Court
  3. unified - a dummy variable indicating whether or not the Congress was controlled by the same party in that period
library(haven)

dahl <- read_dta("data/LittleDahl.dta")
dahl_mod <-lm(nulls ~ age + tenure + unified, data = dahl)

dahl_augment <- dahl %>%
  mutate(hat = hatvalues(dahl_mod),
         student = rstudent(dahl_mod),
         cooksd = cooks.distance(dahl_mod))

ggplot(dahl_augment, aes(hat, student)) +
  geom_point(aes(size = cooksd), shape = 1) +
  geom_text(data = dahl_augment %>%
              arrange(-cooksd) %>%
              slice(1:10),
            aes(label = Congress)) +
  scale_size_continuous(range = c(1, 20)) +
  labs(x = "Leverage",
       y = "Studentized residual") +
  theme(legend.position = "none")

The bubble plot tells us several things:

  • The size of the symbols is proportional to Cook’s D, which is in turn a multiplicative function of the square of the Studentized residuals (Y axis) and the leverage (X axis), so observations farther away from \(Y=0\) and/or have higher values of \(X\) will have larger symbols.
  • The plot tells us whether the large influence of an observation is due to high discrepancy, high leverage, or both
    • The 104th Congress has relatively low leverage but is very discrepant
    • The 74th and 98th Congresses demonstrate both high discrepancy and high leverage

Publishing statistical visualizations

Descriptive statistics

Mosaic plot

# Mosaic plot of happiness and education
library(productplots)
data("happy")

happy <- happy %>%
  na.omit
# contingency table - raw values
happy %>%
  count(happy, sex) %>%
  spread(sex, n) %>%
  kable
happy male female
not too happy 1904 2460
pretty happy 8760 10539
very happy 4833 6327
# contingency table - proportions
with(happy, prop.table(table(happy, sex))) %>%
  kable
male female
not too happy 0.055 0.071
pretty happy 0.252 0.303
very happy 0.139 0.182
# mosaic plot using productplots
prodplot(happy, ~ happy + sex)

# add color
prodplot(happy, ~ happy + sex) +
  aes(fill = happy)

prodplot(happy, ~ happy + marital) +
  aes(fill = happy) +
  theme(legend.position = "none")

# using vcd
library(vcd)
mosaic(~ happy + sex, happy)

Dot plot for summary statistics

Regression results

Session Info

devtools::session_info()
##  setting  value                       
##  version  R version 3.3.3 (2017-03-06)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2017-04-25                  
## 
##  package      * version    date       source                              
##  assertthat     0.1        2013-12-06 CRAN (R 3.3.0)                      
##  backports      1.0.5      2017-01-18 CRAN (R 3.3.2)                      
##  base64enc      0.1-3      2015-07-28 CRAN (R 3.3.0)                      
##  broom        * 0.4.2      2017-02-13 CRAN (R 3.3.2)                      
##  codetools      0.2-15     2016-10-05 CRAN (R 3.3.3)                      
##  colorspace     1.3-2      2016-12-14 CRAN (R 3.3.2)                      
##  DBI            0.6        2017-03-09 CRAN (R 3.3.3)                      
##  devtools       1.12.0     2016-06-24 CRAN (R 3.3.0)                      
##  digest         0.6.12     2017-01-27 CRAN (R 3.3.2)                      
##  dplyr        * 0.5.0      2016-06-24 CRAN (R 3.3.0)                      
##  evaluate       0.10       2016-10-11 CRAN (R 3.3.0)                      
##  expsmooth    * 2.3        2015-04-09 CRAN (R 3.3.0)                      
##  fma          * 2.3        2017-02-12 CRAN (R 3.3.2)                      
##  forcats        0.2.0      2017-01-23 CRAN (R 3.3.2)                      
##  forecast     * 8.0        2017-02-23 CRAN (R 3.3.2)                      
##  foreign        0.8-67     2016-09-13 CRAN (R 3.3.3)                      
##  fpp          * 0.5        2013-03-14 CRAN (R 3.3.0)                      
##  fracdiff       1.4-2      2012-12-02 cran (@1.4-2)                       
##  ggplot2      * 2.2.1      2016-12-30 CRAN (R 3.3.2)                      
##  ggthemes     * 3.4.0      2017-02-19 CRAN (R 3.3.2)                      
##  gtable         0.2.0      2016-02-26 CRAN (R 3.3.0)                      
##  haven        * 1.0.0      2016-09-23 cran (@1.0.0)                       
##  highr          0.6        2016-05-09 CRAN (R 3.3.0)                      
##  hms            0.3        2016-11-22 CRAN (R 3.3.2)                      
##  htmltools      0.3.5      2016-03-21 CRAN (R 3.3.0)                      
##  htmlwidgets    0.8        2016-11-09 CRAN (R 3.3.1)                      
##  httr           1.2.1      2016-07-03 CRAN (R 3.3.0)                      
##  ISLR         * 1.0        2013-06-11 CRAN (R 3.3.0)                      
##  jsonlite       1.3        2017-02-28 CRAN (R 3.3.2)                      
##  knitr        * 1.15.1     2016-11-22 cran (@1.15.1)                      
##  lattice        0.20-34    2016-09-06 CRAN (R 3.3.3)                      
##  lazyeval       0.2.0      2016-06-12 CRAN (R 3.3.0)                      
##  lmtest       * 0.9-35     2017-02-11 CRAN (R 3.3.2)                      
##  lubridate      1.6.0      2016-09-13 CRAN (R 3.3.0)                      
##  magrittr       1.5        2014-11-22 CRAN (R 3.3.0)                      
##  MASS           7.3-45     2016-04-21 CRAN (R 3.3.0)                      
##  Matrix         1.2-8      2017-01-20 CRAN (R 3.3.3)                      
##  MatrixModels * 0.4-1      2015-08-22 CRAN (R 3.3.0)                      
##  memoise        1.0.0      2016-01-29 CRAN (R 3.3.0)                      
##  mnormt         1.5-5      2016-10-15 CRAN (R 3.3.0)                      
##  modelr       * 0.1.0      2016-08-31 CRAN (R 3.3.0)                      
##  munsell        0.4.3      2016-02-13 CRAN (R 3.3.0)                      
##  nlme           3.1-131    2017-02-06 CRAN (R 3.3.3)                      
##  nnet           7.3-12     2016-02-02 CRAN (R 3.3.3)                      
##  plotly       * 4.5.6      2016-11-12 CRAN (R 3.3.2)                      
##  plyr           1.8.4      2016-06-08 CRAN (R 3.3.0)                      
##  productplots * 0.1.1.9000 2017-04-25 Github (hadley/productplots@391f500)
##  psych          1.7.3.21   2017-03-22 CRAN (R 3.3.2)                      
##  purrr        * 0.2.2      2016-06-18 CRAN (R 3.3.0)                      
##  quadprog       1.5-5      2013-04-17 cran (@1.5-5)                       
##  quantreg     * 5.29       2016-09-04 CRAN (R 3.3.0)                      
##  R6             2.2.0      2016-10-05 CRAN (R 3.3.0)                      
##  Rcpp           0.12.10    2017-03-19 cran (@0.12.10)                     
##  readr        * 1.1.0      2017-03-22 cran (@1.1.0)                       
##  readxl         0.1.1      2016-03-28 CRAN (R 3.3.0)                      
##  reshape2       1.4.2      2016-10-22 CRAN (R 3.3.0)                      
##  rmarkdown      1.3        2016-12-21 CRAN (R 3.3.2)                      
##  rprojroot      1.2        2017-01-16 CRAN (R 3.3.2)                      
##  rvest          0.3.2      2016-06-17 CRAN (R 3.3.0)                      
##  scales         0.4.1      2016-11-09 CRAN (R 3.3.1)                      
##  SparseM      * 1.74       2016-11-10 CRAN (R 3.3.2)                      
##  stringi        1.1.2      2016-10-01 CRAN (R 3.3.0)                      
##  stringr      * 1.2.0      2017-02-18 CRAN (R 3.3.2)                      
##  tibble       * 1.2        2016-08-26 cran (@1.2)                         
##  tidyr        * 0.6.1      2017-01-10 CRAN (R 3.3.2)                      
##  tidyverse    * 1.1.1      2017-01-27 CRAN (R 3.3.2)                      
##  timeDate       3012.100   2015-01-23 cran (@3012.10)                     
##  tseries      * 0.10-38    2017-02-20 CRAN (R 3.3.2)                      
##  vcd          * 1.4-3      2016-09-17 CRAN (R 3.3.0)                      
##  viridisLite    0.1.3      2016-03-12 cran (@0.1.3)                       
##  withr          1.0.2      2016-06-20 CRAN (R 3.3.0)                      
##  xml2           1.1.1      2017-01-24 CRAN (R 3.3.2)                      
##  yaml           2.1.14     2016-11-12 cran (@2.1.14)                      
##  zoo          * 1.7-14     2016-12-16 CRAN (R 3.3.2)